Genome wide association study of Escherichia coli bloodstream infection isolates identifies genetic determinants for the portal of entry but not fatal outcome

Escherichia coli is an important cause of bloodstream infections (BSI), which is of concern given its high mortality and increasing worldwide prevalence. Finding bacterial genetic variants that might contribute to patient death is of interest to better understand infection progression and implement diagnostic methods that specifically look for those factors. E. coli samples isolated from patients with BSI are an ideal dataset to systematically search for those variants, as long as the influence of host factors such as comorbidities are taken into account. Here we performed a genome-wide association study (GWAS) using data from 912 patients with E. coli BSI from hospitals in Paris, France. We looked for associations between bacterial genetic variants and three patient outcomes (death at 28 days, septic shock and admission to intensive care unit), as well as two portals of entry (urinary and digestive tract), using various clinical variables from each patient to account for host factors. We did not find any association between genetic variants and patient outcomes, potentially confirming the strong influence of host factors in influencing the course of BSI; we however found a strong association between the papGII operon and entrance of E. coli through the urinary tract, which demonstrates the power of bacterial GWAS when applied to actual clinical data. Despite the lack of associations between E. coli genetic variants and patient outcomes, we estimate that increasing the sample size by one order of magnitude could lead to the discovery of some putative causal variants. Given the wide adoption of bacterial genome sequencing of clinical isolates, such sample sizes may be soon available.


Introduction
Escherichia coli bloodstream infections (BSI) represent an increasing public health burden as (i) they exhibit high mortality (between 10 and 30%) [1,2], (ii) its worldwide prevalence is increasing since the 2000s [3], and (iii) antimicrobial resistance is rising in E. coli [3], which could impact patients' management and infection outcome. Molecular epidemiology of BSI has been refined in the last few years thanks to whole genome sequencing. E. coli has a clonal population structure [4] with the delineation of at least eight phylogroups (A, B1, B2, C, D, E, F and G) [5]. Strains responsible for BSI belong mainly to a few clonal lineages including sequence types (ST) ST131, ST73, ST95, ST69, and all of the B2 and D phylogroups [5]. Until now, classical multivariate analyses have identified host factors and portal of entry as the major determinants of a patient's death, while bacterial genetic traits have been associated with a smaller effect size to mortality or only in a subset of studies [1,2,[6][7][8][9][10].
Bacterial genome wide association studies (GWAS) are now common thanks to an increase in sequencing capacity and specific computational tools [11]; in E. coli they have allowed the identification of genetic traits linked to pathogenicity in avian strains [12], invasiveness in urinary tract infection (UTI) strains [13], and isolation source [14]. However, they failed to identify genetic markers of disease severity in Shigellosis [15]. This could have multiple explanations including small sample sizes that can lead to insufficient power to find causal variants. Disease severity (e.g. patient death) is a trait that is not under selection as it doesn't provide a reproductive advantage, and is therefore less likely to evolve independently across multiple lineages, which in turn makes it less likely to be found through bacterial GWAS [16]. Furthermore, as opposed to antimicrobial resistance which is often caused by a handful of genetic variants, disease severity might involve multiple genetic loci, each with small effects, which are harder to discover.
Identifying microbial genetic elements that contribute to the outcome of BSI is of interest to i) better understand the molecular mechanisms of microbial infection, and ii) improve patient care and prediction of clinically-relevant bacterial traits based on microbial genomics data, which is increasingly becoming available with very low turnaround time [17]. In this context, we performed GWAS on data from two large clinical observational prospective multicentric studies from the Paris area (Septicoli [10] and Colibafi [8]) involving a total of 912 adult patients with E. coli BSI. We used the clinical information from each patient, such as age, comorbidities and treatment as covariates to reduce the influence of host factors in the association analysis [18]. We then performed a power analysis using simulated genotypes and phenotypes to understand which sample size would be appropriate to reach an ever higher statistical power. As microbial whole genome sequencing costs keep reducing we argue that a 10-fold increase in sample size for this kind of studies is likely to be available in the near future.

A combined dataset of 912 BSI patients with matching clinical data and bacterial isolates whole genomes
In this study we combined data from two similar clinical studies (Colibafi [8] and Septicoli [10]), conducted across 11 teaching hospitals, belonging to the same institution, the "Assistance Publique-Hôpitaux de Paris" (AP-HP), across and around Paris, France. The earlier study (Colibafi, 2005) originally included 1,051 patients across the whole of France, with information about bacterial genetic determinants obtained through PCR molecular assays; in this study we kept only those 365 samples originating from 8 hospitals from the Paris area to avoid geographical biases, due for instance to the specialization of some hospitals in the treatment of specific pathologies. From the later study (Septicoli, 2016-7) we kept all the 545 samples from 7 hospitals in the Paris area. Bacterial genomes of these samples from both studies, generated by Illumina technology, were available [19].
We focused on three outcomes for the patients represented in the combined dataset, namely death at 28 days, presence of a septic shock and admission to an intensive care unit (ICU); we note that these outcomes are not mutually exclusive. The prevalence of these outcomes in the two studies was 10.7%, 24% and 14.6%, for death, septic shock and admission to ICU, respectively (S1 Fig). The prevalence of death and admission to ICU was very similar between the two studies, with 12.5% and 9.5% of deaths in the Colibafi and Septicoli studies, respectively, and admission to the ICU reported for 12.5% and 16% of patients. On the other hand, we found a much higher incidence of patients experiencing septic shock in the Septicoli cohort as opposed to Colibafi: 32.5% of patients versus 11.4%. These variations may be due to the different hospitals contributing the clinical data between the two studies: indeed, even though both studies are exclusively focused on the AP-HP teaching hospitals in Paris, only 4 out of 11 hospitals are included in both studies [19]. We additionally focused on the reported portal of entry of the BSI, which has been previously found to be predictive of patient outcome; the urinary and digestive tract portals of entry were the most prevalent in the combined dataset-58.2% and 35.6% of patients, respectively. The other reported portals of entry all had a prevalence below 5%, and we therefore chose to only use the urinary and digestive tract portals of entry for all subsequent analyses. We found that entry through the digestive tract was reported for 41.8% of the patients in the Septicoli study, compared to 26.6% in the Colibafi study, which again may be due to differences in the hospitals providing the data for both studies. The age distribution between the two studies is comparable, with median age of the patients being 67 and 69 years in the Colibafi and Septicoli studies, respectively (S1 Fig). To reduce the influence of these differences between the two studies on our analyses, we introduced the study provenance as a covariant in the combined dataset (S1 Table).

The pathogen portal of entry is associated with BSI outcomes
We found that several clinical variables are associated with the three patient outcomes, consistent with earlier analyses on the two studies alone [8,10] (Table 1). Among other variables, entry through the pulmonary and digestive tract were associated with death (odds ratio 2.88 and 1.51 and p-values 8E -5 and 0.006, respectively), while entry through the urinary tract was found to be negatively associated (odds ratio 0.51, p-value 2E -5 ). Entry through the pulmonary tract was also associated with septic shock (odds ratio 2.12, p-value 4E -3 ), while entry through the digestive tract was associated with patients being admitted to the ICU, among other variables (odds ratio 1.53, p-value 1E -3 ). When combining all clinical variables with association pvalue < 0.1 into a multivariate analysis ( Table 2, see Materials and methods) we found that portal of entry was again the dominant variable associated with patient outcomes, together with study provenance. In particular entry through the pulmonary tract was significantly associated with a patient's death (odds ratio 2.40, p-value 0.003), while entry through the urinary tract was negatively associated (odds ratio 0.64, p-value 0.008). Entry through the pulmonary tract was also significantly associated with a patient experiencing a septic shock (odds ratio 2.10, p-value 0.005), while entry through the digestive tract was associated with a patient being admitted to the ICU (odds ratio 1.57, p-value 0.002). This analysis underscores the influence of the E. coli portal of entry on BSI outcomes; if the portal of entry was influenced by a bacterial genetic variant, we could expect to find it through a statistical association.

The pathogen phylogroup is associated with the portal of entry but not with BSI outcomes
We found that no E. coli phylogroup was associated with patient death (p-value > 0.01), consistent with earlier analyses from the two separate studies [8,10], in contrast to what we previously observed in a mouse model of BSI, in which we found that the B2 phylogroup was associated with the death of the animal [20]. We also observed no association between an isolate's phylogroup and a septic shock or admission to the ICU. The absence of these genetic

PLOS GENETICS
Genetic determinants of E. coli bloodstream infections background effects does not imply that there are no "locus effects", meaning that individual genetic variants may still be found to be associated with patients' outcomes. On the other hand, we found a strong association between the isolates' phylogroup and the urinary and digestive tract portals of entry; phylogroup B2 was associated with the urinary tract, while phylogroups A and, B1 were associated with the digestive tract (p-value < 0.01, S2 Table). Such similarity between the two phenotypes is not surprising given the low prevalence of the other portals of entry, leading to two almost mutually exclusive traits (Fig 1).

Bacterial genetic factors can explain a significant fraction of the variation of the route of infection
We used narrow-sense heritability-the fraction of phenotypic variance that is explained by additive genetic effects [21]-to estimate whether we could expect to find bacterial genetic variants in association with the three patient outcomes or the two main portals of entry. Since we found that clinical variables and the pathogen's phylogroup are associated with our target variables, we measured heritability in three ways: using the phylogroup alone as a genetic effect [16], and using a kinship matrix generated from the whole genetic variation as encoded by unitigs, alone or conditioning the analysis with the clinical variables in order to account for confounding factors (Fig 2). We found that phylogroups could explain 10% of the variation for both the urinary and digestive tract infections (95% CI 0.01% -48.3% for both), but none for any of the three patient outcomes. Overall genetic variation could however explain 22% (95% CI 0% -95.3%) of the variation in admission to the ICU, which was negligibly reduced to 18% (95% CI 0% -96.4%) when considering clinical covariates. While this may seem to indicate that the pathogen genetic variation might influence whether a patient will eventually need intensive care, we noted that this relatively high heritability was present in the Septicoli cohort alone (Fig 2B and 2C). We didn't however find such a discrepancy between the two studies when we estimated the heritability for the portals of entry using either overall genetic variation alone or after conditioning. This indicates that there may be confounding factors that contribute to the decision to change a patient's treatment which vary between the two studies. This is unsurprising, as the decision to admit a patient to the ICU can depend on the subjective assessment of a physician considering a patient's comorbidities, as well as other subtle differences in care protocols. Conversely, the estimated heritability due to genetic effects for the portals of entry varies in magnitude between the combined dataset and the two cohorts alone, but we nonetheless found it to be > 0 in the three datasets. In particular, we found that genetic effects could explain 40% and 46% of the variance of urinary and digestive tract portals of entry, respectively (95% CI 0% -95.7% and 0% -95.8%, respectively), which is more than four times the variance explained by the isolates' phylogroup (Fig 2A). This suggests that a genome-wide association analysis is likely to discover genetic variants associated with the portal of entry for BSI. This relatively high fraction of the phenotypic variability explained by genetic effects is however reduced when conditioning it on other clinical variables (23% and 28% for the urinary and digestive tract portal of entry, respectively, 95% CI 0% -96.4% for both traits), which again underscores the influence of host characteristics in determining the establishment of bloodstream infections.

The papGII operon is associated with the pathogen's entry through the urinary tract
In order to account for both core and accessory genome genetic variability, which is one of the main differences between GWAS studies in human and bacterial datasets, we associated unitigs generated from a de Bruijn graph of all the bacterial isolates against the target variables [22,23]; namely the three patient outcomes and the two major portal of entry for BSI. We used a linear mixed model for the association, which has been shown to better correct for the influence of bacterial population structure in the association [24]. In order to account for the host and clinical factors on target variables, we conducted the association with the clinical variables

PLOS GENETICS
as covariates [25]. Since our earlier analysis indicated that the portal of entry can influence patient outcome, we added this information as covariates when looking for bacterial genetic factors associated with the three patient outcomes.
Consistent with the heritability estimates, we found few or no unitigs associated after multiple testing correction with either the death of the patient (none for both the naïve and conditioned association), the presence of septic shock (one for the naïve association and none for the conditioned association) and admission to ICU (one for both the naïve and conditioned association). Conversely, we found a larger number of unitigs to be associated with either portal of entry; 177 and 53 for the urinary and digestive tract, respectively, when running a naïve association, and a lower number when adding clinical covariates, with 88 unitigs passing the significance threshold for the urinary tract, and none for the digestive tract, respectively (Figs 3A, S2, and S3 Table). Finding an association between individual unitigs and a phenotype of interest may be due to chance, even after multiple testing correction and the inclusion of covariates [26]. To reduce the influence of these factors on the results of the associations, we conducted a stringent analysis when mapping the unitigs back to each bacterial isolate; briefly, we took steps to exclude those unitigs that are mapped to multiple genes across all strains or that are found in a low number of strains (see Materials and methods). After this stringent mapping step, we found no genes with associated unitigs mapped to them for the three patient outcomes and entry through the digestive tract, and 12 genes for the urinary tract portal of entry, independently on whether we used the clinical covariates in the unitig association step (Fig 3B, S4 Table and S1 Data). The absence of any associated gene with the three patients' outcomes is in agreement with the heritability estimates, and with our argument that the relatively high heritability for the admission to the ICU may be the result of confounders. We found similar effect sizes reported for all tested unitigs for the two portals of entry (Pearson's r for the odds-ratio -0.76), but with opposite signs, which is likely the result of the two traits being almost exactly mutually exclusive (S3 Fig). This similarity in the association results is also evident from the Manhattan plot of the two traits (Fig 3C), which shows peaks in the same region of E. coli IAI39 chromosome.
Among these 12 genes associated with urinary tract, six belonged to the pap operon or in its immediate vicinity; the genes in this operon encode for a type P pilus, which has been shown to interact with glycolipids present on uroepithelial cells and is therefore believed to be one of the main defining loci for severe UTI. We found that the papGII variant of the papG gene encoding for the adhesin part of the tip was associated with entry through the urinary tract (S4 Fig). The PapGII adhesin is mainly found in acute pyelonephritis and binds preferentially to Gb4 (GalNAcβ1-3Galα1-4Galβ1-4GlcCer), which is abundant in the upper urinary tract of humans [27]. We found another three genes associated with both portals of entry and encoded in the vicinity of the pap operon, all with high sequence similarity (blastp sequence identity > 95%) to genes annotated as phosphoethanolamine transferases, or opgE. This gene is involved in the biosynthesis of osmoregulated periplasmic glucans (OPGs), which in turn regulate motility and secretion of exopolysaccharides and are considered virulence factors for Gram-negative species [28][29][30][31]. We found these putative opgE genes encoded in the vicinity of phage-derived integrase genes (annotated as intA and intS). The putative opgE gene was encoded in the near vicinity of the pap operon (distance < 15kbp) in 118 strains, and an even shorter distance (< 10kbp) between the pap operon and the edge of its contig for those strains (201) in which the pap operon and the putative opgE gene were encoded in separate contigs (S3 Fig). We therefore concluded that both the putative opgE gene and the integrase genes are

PLOS GENETICS
Genetic determinants of E. coli bloodstream infections part of the same genetic island that may have been acquired through horizontal gene transfer across E. coli strains [13].
Since we observed a strong association between the portal of entry and phylogroup B2, we also ran an association that included only B2 strains and conditioned with the clinical variables (N = 492). Similarly to the analysis with the full dataset, we found one unitig associated with admission to ICU, 46 with entry through the urinary tract, and eight with entry through the digestive tract (S5 Fig and S3 Table), which we mapped to one, three and four genes, respectively (S4 Table). For both portals of entry we found papGII to be associated, as well as papH and papD for the urinary tract and digestive tract, respectively; as observed for the full dataset, the association sign is positive for the urinary tract and negative for the digestive tract. A gene annotated as mdoB was negatively associated with admission to ICU in this lineage specific analysis; this gene is also involved in the biosynthesis of osmoregulated periplasmic glucans (OPGs).

A larger sample size could reveal additional bacterial factors involved in BSI
Our heritability estimates and association results are in good agreement both with previous results about the difficulty of finding bacterial genetic elements associated with virulence from clinical cohorts [16] and with the importance of the pap operon in enabling severe UTI [13]. We next asked whether it would theoretically be possible to find even more associations from cohorts measuring E. coli BSI; would an increase in sample size lead to the discovery of more bacterial genetic factors able to affect the establishment and the outcome of BSI? To answer this question, we generated a dataset of 10,000 simulated genomes-one order of magnitude higher than the dataset presented in this study-with mutation and recombination rates similar to those of E. coli, and two phenotypes with either "high" or "low" heritability (0.2 and 0.05, respectively) [26]. For each phenotype we selected 28 causal variants with a range of effect sizes. We then ran a GWAS on the full dataset and in two smaller samples, in order to determine the empirical statistical power (Fig 4). In this simulated dataset an increase in sample size by an order of magnitude would be needed to discover most of the causal variants (mean recall 57%) for the phenotype with high heritability, which is a large increase from the sample size most similar to this study (1,000 samples, mean recall 5%). Conversely, we found that for the low heritability scenario only a relatively low statistical power (mean recall 10%) could be achieved with a large sample size of 10,000 samples, and no power when using 1,000 samples (mean recall 0). While this simulation cannot be directly compared with the genetics of complex bacterial phenotypes such as BSI caused by E. coli, it points to the theoretical possibility of further refining these results if a larger set of samples could be assembled. This could prove particularly fruitful if patient outcomes are indeed influenced at least partially by bacterial genetic factors.

Discussion
In this study we leveraged the clinical and genetic data of two very similar BSI clinical cohorts in order to test whether E. coli's genetic variation has an influence on the course of severe bloodstream infections. As opposed to our previous analysis using a well-controlled mouse model of sepsis where GWAS identified iron capture systems as main drivers of virulence [20], we did not find a clear locus effect for the three patient outcomes tested here. This is in agreement with a previous work in which we used 60 E. coli strains derived from bacteraemic patients and tested their virulence in the mouse model, looking for genetic determinants for the clinical severity of infection. Indeed, virulence based on an animal model was correlated with bacterial virulence determinants but not with pejorative clinical outcome of BSI [32]. In fact, the animal model is a controlled environment, as the individual tested are healthy and homogeneous (same sex, age, weight and diet) and the standardized inoculation uniformizes the portal of entry, thus allowing for an unbiased evaluation of the intrinsic virulence of each strain [33]. But an important limitation of the mouse sepsis model is that subcutaneous injection in the mouse neck does not reflect the pathophysiological processes of the portal of entry, thus skipping the first steps of BSI. The data presented here once more seems to point to either a negligible influence of bacterial genetic variation on infection outcomes when compared to host and clinical factors, a complex trait influenced by multiple loci, or to a lack of statistical power due to a relatively low sample size.
The results from the heritability analysis from this dataset of combined cohorts is mixed in this regard, as we found that the variance in a patient's death or septic shock is not explained by bacterial genetics, while we found that locus effects may explain up to 22% of the variance in admission to ICU. When we broke down this analysis in the two cohorts alone, we observed that this relatively high heritability is only observed in the Septicoli cohort. As the decision to change the care of a patient is a complex decision dependent on the subjective assessment of clinicians and other hospital-specific policies, we believe that this high estimate may be the result of confounders. A more objective measure of disease burden may therefore be needed in order to properly test for the influence of bacterial genetics on BSI outcomes, together with an increase in sample size, as suggested by our simulations.
On the other hand, we found a clear association between the pap operon and surrounding genes and the urinary route of entry for BSI. This agrees with an earlier study with a similar sample size that looked specifically at invasive UTI [13]. We can point to a common theme in the genome-wide association studies so far conducted in E. coli infection models: the main associated genetic elements come from fairly frequent (~50% of the population) pathogenicity islands that have been previously described, sometimes decades before the ubiquity of genomic data made GWAS studies feasible [34][35][36][37][38][39]. One can then wonder whether these approaches are likely to ever lead to the discovery of previously undescribed genetic variants able to modulate the establishment of disease and its outcomes. We argue that as genomic sequencing of pathogens is becoming a routine part of clinical or epidemiological practice [40][41][42], we will likely eventually reach very large sample sizes, similar to what is currently available for human GWAS studies [43], and possibly larger, as has been recently shown for SARS-Cov-2 genomic epidemiology efforts [44,45]. Apart from increasing the power to discover genetic variants associated with a phenotype, a large sample size would allow for the discovery of rare or ultrarare variants, which in turn may have a relatively large influence on the phenotype of interest, alone or collectively [23], as has recently been appreciated in the study of human traits and disease [46]. In particular, pooling rare variants with similar effects through burden testing approaches might uncover associations with a relatively common trait such as virulence. Additionally, focusing the analysis on a specific phylogroup known to be associated with a trait might help to boost statistical power, as we have shown when using only isolates belonging to the B2 phylogroup.
In the context of bacterial infection, in which we and others have shown how host factors contribute to a large extent, a further help will likely come from including the host genetic variation into the association. A joint human/bacterial association analysis may however require an even larger sample size in order to account for potential interactions between host and bacterial genetic elements [16]. Taken together, the assumed inevitability of clinical genome sequencing together with careful recording of host and clinical data may eventually lead to comprehensively cataloging the fraction of E. coli genetic variants that influence bloodstream infections.

Dataset
The Colibafi and Septicoli studies were prospective observational cohort studies conducted in tertiary-care teaching hospitals in the Paris area. The Colibafi study was performed in 8 hospitals representing a total of 3,900 adult acute care beds whereas 7 hospitals were included in the Septicoli study, accounting for 5,800 acute care beds. Four hospitals were common between the two studies (i.e. 2,900 acute care beds). All the hospitals belong to the same institution, the "Assistance Publique-Hôpitaux de Paris" network, which accounts for a total of 13,000 adult acute care beds with a homogenous management for most bacterial infections. These hospitals receive about 10 million patients each year. As the Paris area is home to 12 million people (18% of the French population), our study can be considered as representative of the French capital, characterized by a high density of inhabitants and multinational exchanges. Adult patients with E. coli BSI were included; patients receiving vasopressors before the onset of BSI were excluded, as were patients that experienced subsequent BSI episodes during each study's period (i.e. only samples and data from the first episode were included). E. coli BSI was defined as the isolation of E. coli from at least 1 blood culture bottle. Sepsis and septic shock were defined as recommended by the 2016 Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3) [47]. Data were prospectively collected by clinicians in each centre on two separate visits: Visit 1 corresponded to the time of BSI (the day the blood culture was drawn; data were collected retrospectively 24-48h hours later, once the blood culture had grown) and Visit 2 corresponded to the day of discharge or in-hospital death (or day 28 if the patient was still hospitalized). For each episode, the first E. coli strain collected in the blood culture was identified. The primary endpoint was vital status at discharge or day 28 (i.e. Visit 2). The likely portal of entry was established according to clinical and/or radiological characteristics of the episodes and the isolation of E. coli from the presumed source of infection. When E. coli could be isolated from the source of infection, the portal of entry was assigned on the basis of firm clinical suspicion. In each centre, an infectious diseases clinician and a microbiologist were in charge of including patients and completing the case report form (see Colibafi and Septicoli groups in the Acknowledgments section). A steering committee was in charge of implementation and a scientific committee responsible for scientific overview.
From the combined dataset we removed those variables with more than 15% of missing values (whether the patient had received a transplant, neutropenia, pregnancy status, body mass index, patient discharge route), and we added a binary variable to record the study provenance of each sample. We imputed the remaining missing values using the MICE package, v3.12.0, using 15 iterations [48]. The raw and imputed combined datasets' summary statistics are available as S1 Table. Univariate and multivariate analysis We tested the association between clinical variables and patient outcome in a similar way as it was done in the original studies [8,10] . Briefly, we first applied a min-max scaler to the age variable to bring it in the [0-1] range. For each patient outcome we then tested each clinical variable using a logistic regression as implemented in the statsmodels package, v0.11.1, using the study provenance as a covariate. We used those variables with association p-value < 0.1 to run a multivariate logistic regression, using a backward stepwise selection method to construct the final model, using the MASS package v7.3_51.3 [49].

Whole genome sequencing and annotation
Bacterial genomes were sequenced using Illumina NextSeq technology as previously described [19]. The genomes from the Colibafi and Septicoli collections are available (Bioproject PRJEB39260 and PRJEB35745, respectively). All genomes were assembled with shovill version 1.0.4 using SPAdes v3.13.1 [50] and standard parameters, and then annotated with Prokka 1.14.5 [51]. A phylogenetic tree was computed from a core genome multiple sequence alignment, as computed by Roary v3.12 [52], using IQ-TREE v1.6.12 [53], under the GTR+F+I+G4 model. The tree was visualized using the iTOL web interface [54]. We collapsed all genes encoded in the sequenced genomes into gene families using panaroo v1.2.4 [55] with default parameters.

Heritability estimates
We estimated narrow-sense heritability for the five target variables, using two different covariance matrices; one built from the phylogroup membership of each strain and another using a kinship matrix built from the unitigs presence and absence matrix derived from the input genomes (see next section). We excluded unitigs present in all samples and sampled 5% of the remaining unitigs. For the latter covariance matrix we also used the same clinical covariates as in the GWAS analysis (see below). We used Limix v3.04 [56], assuming normal errors for the point estimate and we computed the 95% confidence intervals using the ALBI package (commit 90d819e) [57].

Association analyses
We derived unitigs by constructing a compressed de Bruijn graph from the input genomes, using unitig-counter v1.1.0 [22,23]. We computed the distance between each pair of samples by using mash 2.2.2 [58] with a sketch size of 10,000; we used the resulting distance square matrix to compute associations between phylogroups and each target variable, using pyseer v1.3.6 [59]. We tested for locus effects using the unitigs presence/absence vector with the FastLMM [60] linear mixed-model and a kinship matrix derived from the unitigs presence and absence matrix, as described in the previous section, using pyseer v1.3.6 [59]. We run two associations; a "naïve" one that accounted for population structure only, and one additionally conditioning on the clinical variables ("with covariates"). For the three patient outcomes we used all available variables as covariates with the exception of "death", "septic shock" and "admission to ICU", but including the portals of entry, which were excluded when those were the target variables. All the clinical variables used as covariates are described in S1 Table. We determined a significance threshold by counting the number of unique unitigs presence/ absence patterns tested, which reduces the risk of excessively deflating association p-values. We mapped the unitigs passing the significance threshold back to all input genomes and their genes using bwa v0.7.17-r1188 [61] and bedtools v2.30.0 [62,63], using the output of panaroo to assign each unitig to a gene cluster. The unitigs were further filtered to reduce the number of spurious associations: unitigs were excluded if they were shorter than 30bp, if they were mapped to multiple locations in each genome, if they mapped to less than 9 samples (~1% of the total sample size) and if they were mapped to more than 10 different genes across all samples. We further annotated the gene families with mapped unitigs by taking a representative protein sequence from all genomes encoding each gene and using it as an input for eggnogmapper v2.1.3 [64]. The same approach was used to run associations for isolates belonging to the B2 phylogropup.
We tested for the association of rare variants (minimum allele frequency < 1%) by performing a burden test, that is, we performed associations between deleterious rare variants in each gene separately and the five target phenotypes. We derived short variants from each sample against the complete genome of Escherichia coli IAI39-which belongs to phylogroup Fusing snippy v4.6.0 and annotated them using SnpEff v5.0 [65]. We then merged the individual VCF files and filtered for rare variants using bcftools v1.13 [66]. We further filtered the resulting variants according to their annotation: variants annotated as "disruptive", "frameshift", "start codon loss", "stop codon gain", and "stop codon loss"; for missense variants we assessed the likelihood that they were deleterious to protein function using the SIFT algorithm, as implemented in the SIFT4G package v2.0.0 [67], using the uniref50 subset of the Uniprot database [68] (downloaded on June 16, 2021) to construct the multiple sequence alignments. We considered a missense variant to be deleterious if the protein residue had a median information content below 3.25 and score < 0.05. The association was run in a similar way as the one with common unitigs (linear mixed model and clinical covariates) using pyseer v1.3.6 [59]. No significant hit was found with this association method.
Power simulations similar to that estimated in this study. We used the BacGWASim package v2.1.1 [26] to generate both simulated variants and phenotypes. We simulated 10,000 bacterial genomes each 1,000,000 bp long, using a mutation rate of 0.06 and recombination rate of 0.01. We then simulated two binary phenotypes: one with a "high" (0.2) and one with a "low" (0.05) heritability; for both phenotypes we assumed a prevalence of 50% and generated 10 sets of 28 causal variants with minimum allele frequency of 10%. For each batch of simulated phenotypes we ran an association with pyseer v1.3.6 [59] using logistic regression and population structure correction using the first four components of the multidimensional scaling obtained from the samples pairwise distance matrix computed using mash v2.2.2 [58]. Statistical power was computed as the proportion of causal variants that passed the significance threshold, computed by counting the number of unique presence/absence patterns for all tested variants.